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Abstract 

We suggest that the curse of dimensionahty affecting the similarity-based search in large 
datasets is a manifestation of the phenomenon of concentration of measure on high- 
dimensional structures. We prove that, under certain geometric assumptions on the 
query domain S7 and the dataset X, ii satisfies the so-called concentration property, 
then for most query points x* the ball of radius (1 -|- €)dx{x*) centred at x* contains 
either all points of X or else at least Ci exp(— C2e^n) of them. Here dx{x*) is the 
distance from x* to the nearest neighbour in X and n is the dimension of Q. 
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1. Introduction 

As the size of datasets in existence grows at an amazing rate (see e.g. Section 4.1 in 
[ Pip and workloads become ever more sophisticated, algorithms for similarity-based 
data retrieval often slow down exponentially with dimension, sometimes reducing to 
an exhaustive search ('the curse of dimensionality') ||, |l|, It is important to try 
and understand the common geometric nature of the dimensionality curse for a great 
variety of different, often non-euclidean, metric spaces representing data structures 
i I, i 0. 

In this Letter we suggest that the curse of dimensionality is a manifestation of the 
phenomenon of concentration of measure on high- dimensional structures. 

This phenomenon is an important discovery of modern analysis, observed in a 
wide range of situations 0, ^, Roughly speaking, a set fl equipped with 



a distance and a probability measure has the concentration property if already for 
small values of e > the 'e-fattening' of every subset containing at least 1/2 of all 
elements of Q contains all points of Q apart from a set of almost vanishing measure 
a{e). Here a is the so-called concentration function of Q. Many 'naturally occuring' 
high-dimensional structures possess the concentration property: the n-dimensional 
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sphere S", Euclidean unit ball B", Hamming cube {0, 1}", groups of permutations Sn 
all have concentration functions of the form a{e) = 0(1) exp(— O(l)e^n). 

Here we will address just one aspect of 'dimensionality curse,' informally described 
in 1^ as follows: 

'It seems ... that this [exponential] complexity might be inherent in any algorithm 
for solving closest point problems because a point in a high-dimensional space 
can have many "close" neighbours.' 

To formalise this account, we borrow a concept from 0. A similarity query is called 
e- unstable for an e > if most points if the dataset X are at a distance < {l+e)dx{x*) 
from the query point x*, where dx denotes the distance to the nearest neighbour in 
the dataset X. Query instability was shown in to occur under some probability 
assumptions on the query distribution, and it was argued that asking unstable queries 
is partly responsible for the dimensionality curse. It seems to us that even more 
important is a 'local' version of query instability, where the number of data points 
located at a distance < {1 + e)dx{x*) from a query point x* grows exponentially in 
the dimension of the query domain. If such an effect prevails in a given workload, 
then answering the range query of radius (1 + e)dx{x*) obviously takes an average 
expected exponential time, even though the query may be globally stable. 

In our model, a dataset X is a finite metric subspace of a metric space Q of query 
points, the latter being equipped with a probability measure reflecting the query 
distribution. We assume that Q has the concentration property in the sense that 
a(e) = 0(1) exp(— O(l)e^n), where n is the 'dimension' of the query domain Q. Our 
assumption on the way X sits in f2 is of a homogeneity type: the radii of open balls 
centred at x and having measure 1/2 are (almost) the same for all datapoints x. 

Under such assumptions we prove that if e > 0, then for all query points x* G Q, 
apart from a set of measure 0(1) exp(— O(l)e^n), the open ball of radius {l + e)dx{x*) 
centred at x* contains either all points of the dataset X or else at least Ci exp(02e^n) 
of them for some C*i, C*2 > 0. Thus, a typical range query of radius (1 + e)dx{x*) is 
either unstable or takes an exponential time to answer. In particular, most queries 
are unstable if the size of X grows subexponentially in n. 

In Conclusion we explain a possible constructive significance of our results. 



2. Similarity workloads 

Our model builds on the approaches of |^, ^ and A similarity workload is a 
quadruple {Q,d, fi,X), where 

1. f2 is a (possibly infinite) set called the domain, whose elements are query points. 

2. d is a metric on Q, the dissimilarity measure. 

3. /i is a Borel probability measure on the metric space {Q, d), reflecting the query 
point distribution. 

4. X is a finite subset of Q, called the instance, or the dataset proper, whose 
elements are data points. 

Recall that a triple {Q, d, /i) formed by a metric space (f2, d) and a probability Borel 
measure n on it is called a probability metric space. Thus, a similarity workload is a 
probability metric space fl together with a distinguished finite metric subspace X. 
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Similarity queries are of two major types: a range query centred at x* G of radius 
e > (the set of all x ^ X with d{x,x*) < e), and a k-nearest neighbours (k-NN) 
query centred at x* G Q, where G N. 

Following we say that a similarity query centred at x* is e-unstable for an e > 

if 

\{x G X: d{x*,x) < {1 + e)dx{x*)}\ > 

where dx{x*) = min{d{x*, x) : x G X} is the distance from x* to the nearest neighbour 
in X. In the following new type of queries is proposed. 

• An e-radius nearest neighbours query centred at a point x* G Q, where e > 0, is 
a range query centred at x* of the radius {1 + e)dx{x*). 

3. The concentration phenomenon 

The concentration function, a = a^, of a probability metric space Q is defined for 
each e > by 



1 - inf 1^ : ACQ is Borel and > i| (3.1) 

and an(0) = 1/2. It is a decreasing function in e. 

A family (f^n)JJLi of probability metric spaces is called a Levy family if for each 
e > 0, an„ (e) — >■ as n — *• oo, and a normal Levy family (with constants Ci, C2 > 0) 
if for all n and e > 

«Q„(6)<Cie-^^^'". 

All the families listed below are normal Levy families, see 0, ^ |Tn| for exact values 
of constants and further examples. 

Examples 3.1. (1) The n-dimensional unit spheres S" equipped with the (unique) 
rotation-invariant probability measure and the geodesic distance. (2) The same, with 
the Euclidean distance. (3) The Hamming cubes {0, 1}" of all binary strings of length 
n, equipped with the normalised Hamming distance d{s,t) = ^\{i : Si ti}\ and the 
normalised counting measure n^iA) = |y4|/|X|. (4) The groups SO{n) of n x n 
orthogonal matrices with determinant 1, equipped with the geodesic distance and the 
Haar measure. (5) The Euclidean balls B" with the n-volume and Euclidean distance. 
(6) The tori T" with the normalised geodesic distance and product measure. (7) The 
hypercubes [0, 1]" with the normalised Euclidean (or li) distance. 

More Levy families can be obtained using operations described in 0, Sect. 2. 
Let / : — > M be a Lipschitz-1 function: 

Wx,yen, \f{x)-f{y)\<d{x,y). 

Denote by M a median (or Levy mean) of /, that is, a real number with 

lj{{x G X: f{x) < M}) = fi{{x G X: f{x) > M}). 

Proposition 3.2. For every e > 0, fx {f~^{M - e, M + e)) > 1 - 2a{e) . □ 

The phenomenon of concentration of measure on high- dimensional structures refers 
to the above situation, in which the function / 'concentrates near one value.' 
See [0, |,|ig,|Tl. 
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4. Concentration and similarity workloads 

Let Q he Si probability metric space with the concentration function a = a^. The 
following is quite immediate. 

Lemma 4.1. Let ACQ, 6 > 0, and ij{A) > a{6). Then n{Os{A)) > 1/2. □ 

Lemma 4.2. Let 6 > 0, and let 'j be a collection of subsets A (1 Q of measure 
fi{A) < a{6) each, satisfying /u(U7) > 1/2. Then the 26 -neighbourhood of every point 
X & Q, apart from a set of measure at most ^a{6)^ , meets at least [ia(5)~2] elements 
ofl- 

Proof. Partition 7 into a collection of pairwise disjoint subfamilies 'ji, i & I in such a 
way that for every i, a{6) < fi{Ai) < 2a{6), where Ai = U7j. Clearly, (l/4)a(5)~^ < 
|/| < (1/2)q;(5)~^. Select a subset J ^ I with |J| = [^q;((5)~2'| . Lemma 141] implies 
that 

^i (025(A)) > li {Os (^A)) > 1 - a(5), 
and therefore r\i(z,j{02&{A.i) has measure at most 1 — \J\a{5) . □ 

Let (n, d, n, X) be a similarity workload, with a as above. Denote by M a median 
value of the function dx (distance to X) on Q. 

Lemma 4.3. Let 6 > 0. Then for all points x* G Q, except for a set of total mass at 
most 2a{6), the distance to the nearest neighbour in X is in the interval {M—6, M+5). 

Proof. The function x* — > dx{x*) is Lipschitz-1 on Q, and Prop. applies. □ 

Definition 4.4. Let {Q, d, fi, X) be a similarity workload. For an a; G X, denote by 
Rx the maximal radius of an open ball in Q centred at x of measure < 1/2. Let 
e > 0. We say that X is weakly e-homogeneous in Vt if all radii R^, x & X belong to 
an interval of length < e. 

Examples 4.5. (1) X is weakly e-homogeneous for every e > if the group of motions 
preserving the measure acts transitively on Vl. Such are spaces 1-4, 6 in Example 



3^1] . (2) A subspace X of the ball B" is weakly e-homogeneous if X is contained 
in a spherical shell of thickness e. (3) If we independently throw m Vt N points 
xi, 0:2, . . . , xat, distributed with respect to the measure /x, then one can show that, with 
probability > 1 — 2Na{e/2), the dataset X = {xi, . . . , xn} is weakly e-homogeneous. 

5. Query instability: local and global 

Theorem 5.1. Let {Q,d, fi,X) be a similarity workload. Denote by M a median 
value of the distance from a query point in Q to its nearest neighbour in X . Let 
< e < 1, and assume that the instance X is weakly {Me/Q) -homogeneous in VL. 

Then for all points x* G Vl, apart from a set of total mass at most 3Q;(Me/6), the 
open ball of radius (1 + e)dx{x*) centred at x* contains at least 



elements of X . 
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Proof. Denote by R the minimum of the radii Rx, x G X. Let A = R — M. 

(1) If A > Me/6, then by Lemma O! the measure of the ball Om{x) cannot exceed 



«(A) < a(Me/6), for otherwise the measure of Or{x) would be > 1/2. In particular, 
|X| > |a(Me/6)~^ According to Lemma applied to the balls Om{x), x G X 

with S = Me/6, for all x* E apart from a set of measure < ^a(Me/6)^, the 

(Me/3)-neighbourhood of x* meets at least \^a{Me/6)~^~\ of such balls. 

(2) If A < Me/6 (in particular, if |X| < ia(Me/6)-^ < la{Me/6)~\ cf. the 
previous paragraph), then R^ + Me/6 < M(l + e/2). Denote by X' a subset of X of 
cardinality min{|X|, iQ!(Me/6)"2 }. Since the measure of every ball Ca/ (i+e/2) (2;) is 
at least 1 — a;(Me/6), for all x* E apart from a set of measure < |Q;(Me/6)5, the 
(Me/2)-neighbourhood of x* meets every ball Om{x), x E X' . 

As a consequence of Lemma [4.3| with 5 = Me/4, for all x* E fl apart from a set of 



measure at most 2a(Me/4), one has \dx{x*)—M\ < Me/4 and therefore M(l+e/2) < 
dxix*){l + e). It remains to notice that |a(Me/6)^ + 2a(Me/4) < 3a(Me/6)i □ 

Asymptotic results. Let dn, fin, Xn) be an infinite collection of workloads. De- 
note by Mn the median distances from points of f2„ to their nearest neighbours in 
Xn- We make the following standing assumptions. 

(1) The query domains (f2„, d^, fin) form a normal Levy family. 

(2) The values M„ are bounded away from zero: M^ > M > for all n E'H. 

Remark 5.2. The latter condition is only violated in very densely populated domains. 
For example, if f2„ = then (2) is satisfied whenever the size of X„ is not superex- 
ponential in n. For fi„ finite (2) is satisfied if < a;n^(M) ■ \VLn\. 

Now let < e < 1. 

(3) All the instances X„ are weakly (M„e/6)-homogeneous in Vtn- 

Corollary 5.3. Under the assumptions (l)-(3), for all query points x* E apart 
from a set of measure 0(1) exp(— O(l)M^e^n), the open hall of radius (1 + e)dx{x*) 
centred at x* contains either all elements of X or else at least C\ exp(C2M^e^?T.) of 
them for some constants Ci, C2 > depending only on the family {^n)'^=i- CD 

Corollary 5.4. Under the assumptions (l)-(3), for all query points x* , apart from a 
set of measure 0(1) exp(— O(l)M^e^n), the e-radius nearest neighbours query centred 
at X* either is unstable or takes an exponential time (in n) to answer. □ 

Corollary 5.5. In addition to (l)-(3), let the size of Xn grow subexponentially in n. 
Then for all query points x* E Qn, apart from a set of measure 0(1) exp(— O(l)M^e^n), 
the similarity query centred at x* is e-unstable: all points of Xn are at a distance 
< {I + e)dx{x*) from X* . □ 

Example 5.6. It is easy to construct sequences of workloads in which most of similarity 
queries are 1-stable and yet for every e > most of the e-radius NN queries take time 
Ci exp(C'2e^n) to answer. 

Let (5 > be arbitrary. In a probability metric space VL choose a maximal subset X 
with the property that every two different elements of X are at a distance > 5 from 
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each other. It is easy to see that centres of all 1-unstable similarity queries in the 
workload {fl, X) are contained in some ball of radius 45. Applying this procedure to 
every member of a normal Levy family of homogeneous spaces of constant diameter D 
(a typical situation) and choosing 5 < D/8, we obtain a desired sequence of workloads, 
because one can then prove that liminf M„ > 6/2. 

6. Conclusion 

Our model links the 'curse of dimensionality' in multidimensional datasets to the 
phenomenon of concentration of measure on high-dimensional structures. All our 
assumptions on the query domain Q and the dataset X are purely geometric. Our 
estimates are by no means optimal, as we just aimed at deriving exponential lower 
bounds in a wide variety of situations. We believe that the most general case (absence 
of homogeneity in any form) can be included in the picture as well and will address 
the issue in the future work. Other important directions for research are to apply the 
concentration phenomenon to indexability theory p[ and to performance analysis of 
concrete hierarchical tree index structures 0, |^, |^, |^, . 



A possible constructive significance of our results is as follows. In practice, geomet- 
rically optimal dissimilarity measures are being routinely replaced with less precise 
distances that are computationally cheaper, with a view of subsequently discarding 
false hits. Such distances would in general lead to sharper concentration effects on 
the same measure space. It is therefore conceivable that using computationally more 
expensive distances will result in an overall speed-up. 
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